"Principal Components" Enable A New Language of Images

Abstract

We introduce a novel visual tokenization framework that embeds a provablePCA-like structure into the latent token space. While existing visualtokenizers primarily optimize for reconstruction fidelity, they often neglectthe structural properties of the latent space -- a critical factor for bothinterpretability and downstream tasks. Our method generates a 1D causal tokensequence for images, where each successive token contributes non-overlappinginformation with mathematically guaranteed decreasing explained variance,analogous to principal component analysis. This structural constraint ensuresthe tokenizer extracts the most salient visual features first, with eachsubsequent token adding diminishing yet complementary information.Additionally, we identified and resolved a semantic-spectrum coupling effectthat causes the unwanted entanglement of high-level semantic content andlow-level spectral details in the tokens by leveraging a diffusion decoder.Experiments demonstrate that our approach achieves state-of-the-artreconstruction performance and enables better interpretability to align withthe human vision system. Moreover, auto-regressive models trained on our tokensequences achieve performance comparable to current state-of-the-art methodswhile requiring fewer tokens for training and inference.

Quick Read (beta)

loading the full paper ...